AITopics | music caption

Collaborating Authors

music caption

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

MusiScene: Leveraging MU-LLaMA for Scene Imagination and Enhanced Video Background Music Generation

Izzati, Fathinah, Li, Xinyue, Wu, Yuxuan, Xia, Gus

arXiv.org Artificial IntelligenceJul-9-2025

Humans can imagine various atmospheres and settings when listening to music, envisioning movie scenes that complement each piece. For example, slow, melancholic music might evoke scenes of heartbreak, while upbeat melodies suggest celebration. This paper explores whether a Music Language Model, e.g. MU-LLaMA, can perform a similar task, called Music Scene Imagination (MSI), which requires cross-modal information from video and music to train. To improve upon existing music captioning models which focusing solely on musical elements, we introduce MusiScene, a music captioning model designed to imagine scenes that complement each music. In this paper, (1) we construct a large-scale video-audio caption dataset with 3,371 pairs, (2) we finetune Music Understanding LLaMA for the MSI task to create MusiScene, and (3) we conduct comprehensive evaluations and prove that our MusiScene is more capable of generating contextually relevant captions compared to MU-LLaMA. We leverage the generated MSI captions to enhance Video Background Music Generation (VBMG) from text.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2507.05894

Genre: Research Report > New Finding (0.47)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.95)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.48)

Add feedback

Cross-Modal Learning for Music-to-Music-Video Description Generation

Mao, Zhuoyuan, Zhao, Mengjie, Wu, Qiyu, Zhong, Zhi, Liao, Wei-Hsiang, Wakaki, Hiromi, Mitsufuji, Yuki

arXiv.org Artificial IntelligenceMar-14-2025

Music-to-music-video generation is a challenging task due to the intrinsic differences between the music and video modalities. The advent of powerful text-to-video diffusion models has opened a promising pathway for music-video (MV) generation by first addressing the music-to-MV description task and subsequently leveraging these models for video generation. In this study, we focus on the MV description generation task and propose a comprehensive pipeline encompassing training data construction and multimodal model fine-tuning. We fine-tune existing pre-trained multimodal models on our newly constructed music-to-MV description dataset based on the Music4All dataset, which integrates both musical and visual information. Our experimental results demonstrate that music representations can be effectively mapped to textual domains, enabling the generation of meaningful MV description directly from music inputs. We also identify key components in the dataset construction pipeline that critically impact the quality of MV description and highlight specific musical attributes that warrant greater focus for improved MV description generation.

dataset, music, mv description, (15 more...)

arXiv.org Artificial Intelligence

2503.1119

Country:

Europe > Austria > Vienna (0.14)
South America > Uruguay > Maldonado > Maldonado (0.04)
South America > Brazil > Rio de Janeiro > Niterói (0.04)
(8 more...)

Genre: Research Report > New Finding (0.87)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.98)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.30)

Add feedback

Can Impressions of Music be Extracted from Thumbnail Images?

Harada, Takashi, Motomitsu, Takehiro, Hayashi, Katsuhiko, Sakai, Yusuke, Kamigaito, Hidetaka

arXiv.org Artificial IntelligenceJan-5-2025

In recent years, there has been a notable increase in research on machine learning models for music retrieval and generation systems that are capable of taking natural language sentences as inputs. However, there is a scarcity of large-scale publicly available datasets, consisting of music data and their corresponding natural language descriptions known as music captions. In particular, non-musical information such as suitable situations for listening to a track and the emotions elicited upon listening is crucial for describing music. This type of information is underrepresented in existing music caption datasets due to the challenges associated with extracting it directly from music data. To address this issue, we propose a method for generating music caption data that incorporates non-musical aspects inferred from music thumbnail images, and validated the effectiveness of our approach through human evaluations. Additionally, we created a dataset with approximately 360,000 captions containing non-musical aspects. Leveraging this dataset, we trained a music retrieval model and demonstrated its effectiveness in music retrieval tasks through evaluation.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2501.02511

Country: Asia > Japan (0.14)

Genre: Research Report (0.82)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.48)

Add feedback

M2M-Gen: A Multimodal Framework for Automated Background Music Generation in Japanese Manga Using Large Language Models

Sharma, Megha, Haseeb, Muhammad Taimoor, Xia, Gus, Tsuruoka, Yoshimasa

arXiv.org Artificial IntelligenceOct-13-2024

This paper introduces M2M Gen, a multi modal framework for generating background music tailored to Japanese manga. The key challenges in this task are the lack of an available dataset or a baseline. To address these challenges, we propose an automated music generation pipeline that produces background music for an input manga book. Initially, we use the dialogues in a manga to detect scene boundaries and perform emotion classification using the characters faces within a scene. Then, we use GPT4o to translate this low level scene information into a high level music directive. Conditioned on the scene information and the music directive, another instance of GPT 4o generates page level music captions to guide a text to music model. This produces music that is aligned with the mangas evolving narrative. The effectiveness of M2M Gen is confirmed through extensive subjective evaluations, showcasing its capability to generate higher quality, more relevant and consistent music that complements specific scenes when compared to our baselines.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2410.09928

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.16)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.15)

Genre:

Research Report > Experimental Study (0.69)
Research Report > New Finding (0.69)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Futga: Towards Fine-grained Music Understanding through Temporally-enhanced Generative Augmentation

Wu, Junda, Novack, Zachary, Namburi, Amit, Dai, Jiaheng, Dong, Hao-Wen, Xie, Zhouhang, Chen, Carol, McAuley, Julian

arXiv.org Artificial IntelligenceJul-29-2024

Existing music captioning methods are limited to generating concise global descriptions of short music clips, which fail to capture fine-grained musical characteristics and time-aware musical changes. To address these limitations, we propose FUTGA, a model equipped with fined-grained music understanding capabilities through learning from generative augmentation with temporal compositions. We leverage existing music caption datasets and large language models (LLMs) to synthesize fine-grained music captions with structural descriptions and time boundaries for full-length songs. Augmented by the proposed synthetic dataset, FUTGA is enabled to identify the music's temporal changes at key transition points and their musical functions, as well as generate detailed descriptions for each music segment. We further introduce a full-length music caption dataset generated by FUTGA, as the augmentation of the MusicCaps and the Song Describer datasets. We evaluate the automatically generated captions on several downstream tasks, including music generation and retrieval. The experiments demonstrate the quality of the generated captions and the better performance in various downstream tasks achieved by the proposed music captioning approach. Our code and datasets can be found in \href{https://huggingface.co/JoshuaW1997/FUTGA}{\textcolor{blue}{https://huggingface.co/JoshuaW1997/FUTGA}}.

caption, music, music caption, (14 more...)

arXiv.org Artificial Intelligence

2407.20445

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
North America > United States > California > San Diego County > San Diego (0.04)

Genre: Research Report (0.40)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.89)

Add feedback